In market research, obtaining reliable and trustworthy survey data is a challenge for surveyors. This is a challenge because the survey results obtained can be subjective or dishonest answers due to conflicts of interest. This project started by audio data preprocessing, exploratory data analysis, feature extraction, modelling, model evaluation, and conclusion. This project can help various department ranging from marketing and market research departments, the customer service department, the Human Resources department, the recruitment and job interviews, and employee training and development. And, this project can be implemented in various industries as well as telecommunication, internet provider, banking or fintech, home & household appliance, customer service, human resources / outsourcing, travel & hotels, training & education, call center, etc.
Communication is defined as the process of understanding and sharing meaning (Pearson & Nelson, 2000). In the certain business area, communication is a way to understand the interlocutor as well as customers, potential customers, conversation partner, vendor, etc. Especially in market research at marketing department, the company must be able to gather valuable information from customers as much as possible so that the company can grow exponentially by understanding the customers or potential customers needs & wants.
The company uses Value Preposition Canvas (VPC) framework to conduct product development so that the company can generate a new product which represent the customer's needs. This framework helps companies or entrepreneurs to solve problems and satisfy the needs of the customer by discovering the customer's pain through identifying the customer's jobs that need to be done. Therefore, to make a customer's job list the company requires to conduct qualitative research or quantitative research.
Using the company's resources and capabilities, the valuable informations can be used to conduct market research, campaign analysis, product development, process improvement, service improvement, customer satisfaction, product evaluation, service evaluation, customer behavior and so on.
There are several informations that company can obtain or gather ranging from needs, wants, complain, review, feed back towards product, response towards new campaign, customer sentiment, etc. Mostly the company's data is taken from questionnaires, observation, interview, social media. Hence, there are two type of datas which taken from survey namely audio data and text data.
But the problem is the results of data collection is subjective or dishonest answer due to conflict of interest towards brand or surveyor. And it will impact to the company's ability to conduct market research also ability to understand the customers very well. Therefore, it will lead to misleading information, higher customer churn rate, miss communication, bad customer experience, and so on.
In order to minimize misleading information when the company is conducting market research, hence this project develops a classification model which can classify the emotion of a person towards a product, a service, or specific campaign. And the emotions that model classify is anger, happy, neutral, sad.
This project only processes the audio data set to classify human emotions. This project uses CREMA-D data set which is taken from https://github.com/CheyneyComputerScience/CREMA-D . CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified). Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) and four different emotion levels (Low, Medium, High, and Unspecified). But, in order to make a prototype with the limited resources this project only proceed with four human emotions as target class and with small number of clips (1183 clips).
The data set is an accordance with business need for this project due to:
The project output is a dashboard with classification model that can classify several emotions ranging from anger, happiness, sadness, and neutral in real time. Then, there are several features that dashborad provided as well as:
Therefore, by building emotion speech recognition project can help the company to grow, as follows:
In marketing and market research departments, emotional speech recognition can be used to analyze customer sentiments, feelings, and emotional responses. This model helps companies understand how customers emotionally respond towards a product, a service, or specific campaign and make decisions based on the analysis. The company can gather customer voice data through various channels, such as recorded customer service calls, one-on-one customer interview, focus group discussion recording, or voice recordings uploaded by customers in the form of testimonials or product reviews in social media.
In the customer service department, emotional speech recognition can be used to understand the emotions and feelings of customers during interactions with customer service agents. As a result, companies can respond better and provide appropriate solutions to enhance customer satisfaction, customer experience, and customer loyalty with strategic communication improvement.
In the Human Resources department, emotional speech recognition can be used to monitor and analyze employees’ emotional expressions during meetings, presentations, or team interactions. This information can help managers or HR teams to understand employees’ satisfaction levels, anxiety, or happiness so that the HR teams can take appropriate actions to improve their well-being.
In the recruitment and job interviews, emotional speech recognition can assist companies in analyzing the speech and emotional responses of candidates during job interviews. And, it can provide additional insights into their personality, interpersonal skills, and cultural fit with the company.
In employee training and development, emotional speech recognition can be used to provide real-time feedback and evaluation on how employees communicate emotionally. This can help to improve communication skills, emotional management, and interpersonal interactions, human resources / outsourcing.
This project can be implemented in various industries as well as telecommunication, internet provider, banking or fintech, home & household appliance, customer service, human resources / outsourcing, travel & hotels, training & education, call center, etc.
In marketing and market research departments, emotional speech recognition can be used to analyze customer sentiments, feelings, and emotional responses towards products, services, or specific campaigns. This helps companies understand how customers emotionally respond and make decisions based on the analysis. The company can gather customer voice data through various channels, such as recorded customer service calls, interview or focus group discussion recording, or voice recordings uploaded by customers in the form of testimonials or product reviews.
In the customer service department, emotional speech recognition can be used to understand the emotions and feelings of customers during interactions with customer service agents. As a result, companies can respond better and provide appropriate solutions to enhance customer satisfaction, customer experience, and customer loyalty with strategic communication improvement.
In the Human Resources department, emotional speech recognition can be used to monitor and analyze employees’ emotional expressions during meetings, presentations, or team interactions. This information can help managers or HR teams to understand employees’ satisfaction levels, anxiety, or happiness so that the HR teams can take appropriate actions to improve their well-being.
In the recruitment and job interviews, emotional speech recognition can assist companies in analyzing the speech and emotional responses of candidates during job interviews. And, it can provide additional insights into their personality, intrapersonal skills, and cultural fit with the company.
In employee training and development, emotional speech recognition can be used to provide real-time feedback and evaluation on how employees communicate emotionally. This can help to improve communication skills, emotional management, and interpersonal interactions, human resources / outsourcing.
This project can be implemented in various industries as well as telecommunication, internet provider, banking or fintech, home & household appliance, customer service, human resources / outsourcing, travel & hotels, training & education, call center, etc.
# Base library
import os
import math
# Exploratory data
import pandas as pd
import numpy as np
from collections import Counter
%matplotlib inline
import matplotlib.pyplot as plt
import librosa
import librosa.display
# Playing the audio
from IPython.display import display
import IPython.display as ipd
path = 'data_input/'
# Fetches all filenames in the folder
files = os.listdir(path)
# Membuat dictionary untuk menyimpan kondisi sesuai dengan nama depan file
conditions = {}
for file in files:
name_parts = file.split('_')
name_1 = name_parts[0]
name_2 = name_parts[2]
name_3 = name_parts[3]
if name_1 == '1001':
conditions[file] = {'Age': 51, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1002':
conditions[file] = {'Age': 21, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1003':
conditions[file] = {'Age': 21, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1004':
conditions[file] = {'Age': 42, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1005':
conditions[file] = {'Age': 29, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1006':
conditions[file] = {'Age': 58, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1007':
conditions[file] = {'Age': 38, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1008':
conditions[file] = {'Age': 46, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1009':
conditions[file] = {'Age': 24, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1010':
conditions[file] = {'Age': 27, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1011':
conditions[file] = {'Age': 32, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1012':
conditions[file] = {'Age': 23, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1013':
conditions[file] = {'Age': 22, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
elif name_1 == '1014':
conditions[file] = {'Age': 24, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1015':
conditions[file] = {'Age': 32, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1016':
conditions[file] = {'Age': 61, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1017':
conditions[file] = {'Age': 42, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1018':
conditions[file] = {'Age': 25, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
elif name_1 == '1019':
conditions[file] = {'Age': 29, 'Sex': 'Male', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1020':
conditions[file] = {'Age': 61, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1021':
conditions[file] = {'Age': 30, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1022':
conditions[file] = {'Age': 22, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1023':
conditions[file] = {'Age': 22, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1024':
conditions[file] = {'Age': 59, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1025':
conditions[file] = {'Age': 48, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1026':
conditions[file] = {'Age': 33, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1027':
conditions[file] = {'Age': 44, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1028':
conditions[file] = {'Age': 57, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1029':
conditions[file] = {'Age': 33, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1030':
conditions[file] = {'Age': 42, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1031':
conditions[file] = {'Age': 31, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
elif name_1 == '1032':
conditions[file] = {'Age': 30, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1033':
conditions[file] = {'Age': 31, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1034':
conditions[file] = {'Age': 74, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1035':
conditions[file] = {'Age': 48, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1036':
conditions[file] = {'Age': 49, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1037':
conditions[file] = {'Age': 45, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1038':
conditions[file] = {'Age': 21, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1039':
conditions[file] = {'Age': 51, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1040':
conditions[file] = {'Age': 42, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1041':
conditions[file] = {'Age': 42, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1042':
conditions[file] = {'Age': 37, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1043':
conditions[file] = {'Age': 25, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
elif name_1 == '1044':
conditions[file] = {'Age': 40, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1045':
conditions[file] = {'Age': 22, 'Sex': 'Male', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1046':
conditions[file] = {'Age': 22, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
elif name_1 == '1047':
conditions[file] = {'Age': 22, 'Sex': 'Female', 'Race': 'Unknown', 'Ethnicity': 'Hispanic'}
elif name_1 == '1048':
conditions[file] = {'Age': 38, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
elif name_1 == '1049':
conditions[file] = {'Age': 25, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
elif name_1 == '1050':
conditions[file] = {'Age': 62, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1051':
conditions[file] = {'Age': 56, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1052':
conditions[file] = {'Age': 33, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1053':
conditions[file] = {'Age': 35, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1054':
conditions[file] = {'Age': 36, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1055':
conditions[file] = {'Age': 57, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1056':
conditions[file] = {'Age': 52, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1057':
conditions[file] = {'Age': 25, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1058':
conditions[file] = {'Age': 36, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1059':
conditions[file] = {'Age': 21, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1060':
conditions[file] = {'Age': 28, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1061':
conditions[file] = {'Age': 51, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1062':
conditions[file] = {'Age': 56, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1063':
conditions[file] = {'Age': 33, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1064':
conditions[file] = {'Age': 53, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1065':
conditions[file] = {'Age': 38, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1066':
conditions[file] = {'Age': 25, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1067':
conditions[file] = {'Age': 66, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1068':
conditions[file] = {'Age': 34, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1069':
conditions[file] = {'Age': 27, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1070':
conditions[file] = {'Age': 25, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1071':
conditions[file] = {'Age': 41, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1072':
conditions[file] = {'Age': 33, 'Sex': 'Female', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1073':
conditions[file] = {'Age': 24, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Hispanic'}
elif name_1 == '1074':
conditions[file] = {'Age': 31, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1075':
conditions[file] = {'Age': 40, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1076':
conditions[file] = {'Age': 25, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1077':
conditions[file] = {'Age': 20, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1078':
conditions[file] = {'Age': 21, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1079':
conditions[file] = {'Age': 21, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
elif name_1 == '1080':
conditions[file] = {'Age': 21, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1081':
conditions[file] = {'Age': 30, 'Sex': 'Male', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1082':
conditions[file] = {'Age': 20, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1083':
conditions[file] = {'Age': 45, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1084':
conditions[file] = {'Age': 46, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1085':
conditions[file] = {'Age': 34, 'Sex': 'Male', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1086':
conditions[file] = {'Age': 33, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1087':
conditions[file] = {'Age': 62, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1088':
conditions[file] = {'Age': 23, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1089':
conditions[file] = {'Age': 24, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1090':
conditions[file] = {'Age': 50, 'Sex': 'Male', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
elif name_1 == '1091':
conditions[file] = {'Age': 29, 'Sex': 'Female', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
if name_2 == 'ANG':
Level = 'Anger'
elif name_2 == 'HP':
Level = 'Happiness'
elif name_2 == 'SAD':
Level = 'Sadness'
else:
Level = 'Neutral'
if name_3 == 'HI':
Level = 'High'
elif name_3 == 'MD':
Level = 'Mid'
elif name_3 == 'LO':
Level = 'Low'
else:
Level = 'Unspecified'
# Konversi dictionary menjadi dataframe
df = pd.DataFrame.from_dict(conditions, orient='index')
audio = df.reset_index()
audio.columns = ['File_Name','Age','Sex','Race','Ethnicity']
# Sort by filename index-0, index-2, and index-3
audio = audio.sort_values(by=['File_Name'])
# Create a new column from splitting the File_Name column for sort value matter
audio[['col0', 'col2', 'col3']] = audio['File_Name'].str.split('_', expand=True)[[0, 2, 3]]
audio = audio.sort_values(by=['col0', 'col2', 'col3'])
audio
| File_Name | Age | Sex | Race | Ethnicity | col0 | col2 | col3 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1001_IEO_ANG_HI.wav | 51 | Male | Caucasian | Not_Hispanic | 1001 | ANG | HI.wav |
| 1 | 1001_IEO_ANG_LO.wav | 51 | Male | Caucasian | Not_Hispanic | 1001 | ANG | LO.wav |
| 2 | 1001_IEO_ANG_MD.wav | 51 | Male | Caucasian | Not_Hispanic | 1001 | ANG | MD.wav |
| 12 | 1001_WSI_ANG_XX.wav | 51 | Male | Caucasian | Not_Hispanic | 1001 | ANG | XX.wav |
| 3 | 1001_IEO_HAP_HI.wav | 51 | Male | Caucasian | Not_Hispanic | 1001 | HAP | HI.wav |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1181 | 1091_IWW_NEU_XX.wav | 29 | Female | Asian | Not_Hispanic | 1091 | NEU | XX.wav |
| 1176 | 1091_IEO_SAD_HI.wav | 29 | Female | Asian | Not_Hispanic | 1091 | SAD | HI.wav |
| 1177 | 1091_IEO_SAD_LO.wav | 29 | Female | Asian | Not_Hispanic | 1091 | SAD | LO.wav |
| 1178 | 1091_IEO_SAD_MD.wav | 29 | Female | Asian | Not_Hispanic | 1091 | SAD | MD.wav |
| 1179 | 1091_IOM_SAD_XX.wav | 29 | Female | Asian | Not_Hispanic | 1091 | SAD | XX.wav |
1183 rows × 8 columns
# Drop unnecessary columns
audio = audio.drop(columns=['File_Name', 'col0'])
# Create a new column from splitting the col3 column to get Emotion & Emotion Level
audio[['Emotion_Lvl','Format']]=audio['col3'].str.split('.',expand=True)[[0,1]]
# Drop unnecessary columns
audio = audio.drop(columns=['col3','Format'])
# Rename col2 column
audio.rename(columns = {'col2' : 'Emotion'}, inplace = True)
# Reset Index
audio = audio.reset_index()
audio = audio.drop(columns=['index'])
audio
| Age | Sex | Race | Ethnicity | Emotion | Emotion_Lvl | |
|---|---|---|---|---|---|---|
| 0 | 51 | Male | Caucasian | Not_Hispanic | ANG | HI |
| 1 | 51 | Male | Caucasian | Not_Hispanic | ANG | LO |
| 2 | 51 | Male | Caucasian | Not_Hispanic | ANG | MD |
| 3 | 51 | Male | Caucasian | Not_Hispanic | ANG | XX |
| 4 | 51 | Male | Caucasian | Not_Hispanic | HAP | HI |
| ... | ... | ... | ... | ... | ... | ... |
| 1178 | 29 | Female | Asian | Not_Hispanic | NEU | XX |
| 1179 | 29 | Female | Asian | Not_Hispanic | SAD | HI |
| 1180 | 29 | Female | Asian | Not_Hispanic | SAD | LO |
| 1181 | 29 | Female | Asian | Not_Hispanic | SAD | MD |
| 1182 | 29 | Female | Asian | Not_Hispanic | SAD | XX |
1183 rows × 6 columns
audio.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1183 entries, 0 to 1182 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1183 non-null int64 1 Sex 1183 non-null object 2 Race 1183 non-null object 3 Ethnicity 1183 non-null object 4 Emotion 1183 non-null object 5 Emotion_Lvl 1183 non-null object dtypes: int64(1), object(5) memory usage: 55.6+ KB
# Change to category data type
audio[['Sex','Race','Ethnicity','Emotion','Emotion_Lvl']] = audio[['Sex','Race','Ethnicity','Emotion','Emotion_Lvl']].astype('category')
audio.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1183 entries, 0 to 1182 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1183 non-null int64 1 Sex 1183 non-null category 2 Race 1183 non-null category 3 Ethnicity 1183 non-null category 4 Emotion 1183 non-null category 5 Emotion_Lvl 1183 non-null category dtypes: category(5), int64(1) memory usage: 16.0 KB
audio.nunique()
Age 38 Sex 2 Race 4 Ethnicity 2 Emotion 4 Emotion_Lvl 4 dtype: int64
Insight:
The nunique() function result above shows that the Age column has 38 unique values, Sex column has 2 unique values, Race column has 4 unique values, Ethnicity column has 2 unique values, Emotion column has 4 unique values,and Emotion_Lvl column has 4 unique values.
audio['Age'].unique()
array([51, 21, 42, 29, 58, 38, 46, 24, 27, 32, 23, 22, 61, 25, 30, 59, 48,
33, 44, 57, 31, 74, 49, 45, 37, 40, 62, 56, 35, 36, 52, 28, 53, 66,
34, 41, 20, 50], dtype=int64)
age=pd.crosstab(index=audio['Age'],
columns='Total')
age.sort_values(by=['Total'],ascending=False).T
| Age | 21 | 25 | 22 | 33 | 42 | 24 | 29 | 30 | 31 | 51 | ... | 41 | 52 | 50 | 49 | 44 | 37 | 35 | 28 | 74 | 58 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| col_0 | |||||||||||||||||||||
| Total | 91 | 91 | 78 | 78 | 65 | 51 | 40 | 39 | 39 | 39 | ... | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 12 |
1 rows × 38 columns
Insight:
unique() function result above shows that the minimun age is 20 and maximum age is 74.crosstab() function result above shows that the 21 and 25 years old took the first and second positions with a total of 91, then the 22 and 33 years old took the second and third positions with a total of 78. And last position is people with 58 year old.audio['Sex'].unique()
['Male', 'Female'] Categories (2, object): ['Female', 'Male']
sex=pd.crosstab(index=audio['Sex'],
columns='Total')
sex.sort_values(by=['Total'],ascending=False).plot(kind='bar',rot=0)
<Axes: xlabel='Sex'>
Insight:
unique() function result above shows that the Sex column has two values which are Male and Female.audio['Race'].unique()
['Caucasian', 'African_American', 'Asian', 'Unknown'] Categories (4, object): ['African_American', 'Asian', 'Caucasian', 'Unknown']
race=pd.crosstab(index=audio['Race'],
columns='Total')
race.sort_values(by=['Total'],ascending=False).plot(kind='bar',rot=0)
<Axes: xlabel='Race'>
Insight:
unique() function result above shows that the Race column consist of Caucasian, African_American, Asian, and Unknown.crosstab() function result above shows that the Caucasian race is higher than others with the amount above 700. And the lowest positions are Asian race and Unknown race with the amount below 100.audio['Ethnicity'].unique()
['Not_Hispanic', 'Hispanic'] Categories (2, object): ['Hispanic', 'Not_Hispanic']
eth=pd.crosstab(index=audio['Ethnicity'],
columns='Total')
eth.sort_values(by=['Total'],ascending=False).plot(kind='bar',rot=0)
<Axes: xlabel='Ethnicity'>
Insight:
unique() function result above shows that the Ethnicity column consist of Not_Hispanic and Hispanic.audio['Emotion'].unique()
['ANG', 'HAP', 'NEU', 'SAD'] Categories (4, object): ['ANG', 'HAP', 'NEU', 'SAD']
emo=pd.crosstab(index=audio['Emotion'],
columns='Total')
emo.sort_values(by=['Total'],ascending=False).plot(kind='bar',rot=0)
<Axes: xlabel='Emotion'>
Insight:
unique() function result above shows that the Emotion column consist of ANG (Anger), HAP (Happiness), NEU (Neutral), and SAD (Sadness).audio['Emotion_Lvl'].unique()
['HI', 'LO', 'MD', 'XX'] Categories (4, object): ['HI', 'LO', 'MD', 'XX']
lvl=pd.crosstab(index=audio['Emotion_Lvl'],
columns='Total')
lvl.sort_values(by=['Total'],ascending=False).plot(kind='bar',rot=0)
<Axes: xlabel='Emotion_Lvl'>
Insight:
unique() function result above shows that there are four emotion levels which are HI (High), LO (Low), MD (Medium), and XX (Unspecified).The amplitude envelope is a curve that represents the change in amplitude of an audio signal over time. This envelope provides information about the dynamics of the audio signal and can assist in the analysis and extraction of sound features that are useful in various applications such as speech recognition, music analysis, and other audio processing.
# Display Anger with High Emotion Level Audio Player - 1001_IEO_ANG_HI.wav
ipd.Audio("data_input/1001_IEO_ANG_HI.wav")
# Display Anger with Low Emotion Level Audio Player - 1088_IEO_ANG_LO.wav
ipd.Audio("data_input/1088_IEO_ANG_LO.wav")
# Display Anger with Medium Emotion Level Audio Player - 1018_IEO_ANG_MD.wav
ipd.Audio("data_input/1018_IEO_ANG_MD.wav")
# Display Anger with Unspecified Emotion Level Audio Player - 1019_MTI_ANG_XX.wav
ipd.Audio("data_input/1019_MTI_ANG_XX.wav")
FIG_SIZE_L = (10,15)
PATH_L = "data_input/"
files = ["1001_IEO_ANG_HI.wav", "1088_IEO_ANG_LO.wav", "1018_IEO_ANG_MD.wav", "1019_MTI_ANG_XX.wav"]
for item in files:
FILE_PATH_L = PATH_L + item
# load audio file with Librosa
signal, sample_rate = librosa.load(FILE_PATH_L, sr=44100)
# display waveform
plt.figure(figsize=(12, 4))
librosa.display.waveshow(signal, sr=sample_rate, alpha=0.4)
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.yticks(np.arange(-1, 1.25, 0.5))
plt.title(f"Waveform ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
plt.show()
Insight:
The plot above have several insights, as follows:
In anger emotion with various emotion levels, the amplitude envelope is used to extract features related to the strength and intensity of the sound, which can then be used as features in machine learning models to classify anger emotion with various emotion levels. Hence, by knowing the amplitude as the strength and intensity of the sound, the customer can be identified by the company for who doesn't love / is unsatisfied toward the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.
# Display Happiness with High Emotion Level Audio Player - 1090_IEO_HAP_HI.wav
ipd.Audio("data_input/1090_IEO_HAP_HI.wav")
# Display Happiness with Low Emotion Level Audio Player - 1065_IEO_HAP_LO.wav
ipd.Audio("data_input/1065_IEO_HAP_LO.wav")
# Display Happiness with Medium Emotion Level Audio Player - 1044_IEO_HAP_MD.wav
ipd.Audio("data_input/1044_IEO_HAP_MD.wav")
# Display Happiness with Unspecified Emotion Level Audio Player - 1029_IWL_HAP_XX.wav
ipd.Audio("data_input/1029_IWL_HAP_XX.wav")
FIG_SIZE_H = (10,15)
PATH_H = "data_input/"
files_H = ["1090_IEO_HAP_HI.wav", "1065_IEO_HAP_LO.wav", "1044_IEO_HAP_MD.wav", "1029_IWL_HAP_XX.wav"]
for item in files_H:
FILE_PATH_H = PATH_H + item
# load audio file with Librosa
signal, sample_rate = librosa.load(FILE_PATH_H, sr=44100)
# display waveform
plt.figure(figsize=(12, 4))
librosa.display.waveshow(signal, sr=sample_rate, alpha=0.4)
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.yticks(np.arange(-1, 1.25, 0.5))
plt.title(f"Waveform ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
plt.show()
Insight:
The plot above have several insights, as follows:
In happiness emotion with various emotion levels, the amplitude envelope is used to extract features related to the strength and intensity of the sound, which can then be used as features in machine learning models to classify happiness emotion with various emotion levels. Hence, by knowing the amplitude as the strength and intensity of the sound, the customer can be identified by the company for who loves / is satisfied toward the company's product/service/specific campaign/specific questionnaire question in conducting the market research with this machine learning model.
# Display Sadness with High Emotion Level Audio Player - 1054_IEO_SAD_HI.wav
ipd.Audio("data_input/1054_IEO_SAD_HI.wav")
# Display Sadness with Low Emotion Level Audio Player - 1055_IEO_SAD_LO.wav
ipd.Audio("data_input/1055_IEO_SAD_LO.wav")
# Display Sadness with Medium Emotion Level Audio Player - 1043_IEO_SAD_MD.wav
ipd.Audio("data_input/1043_IEO_SAD_MD.wav")
# Display Sadness with Unspecified Emotion Level Audio Player - 1035_IOM_SAD_XX.wav
ipd.Audio("data_input/1035_IOM_SAD_XX.wav")
FIG_SIZE_S = (10,15)
PATH_S = "data_input/"
files_S = ["1054_IEO_SAD_HI.wav", "1055_IEO_SAD_LO.wav", "1043_IEO_SAD_MD.wav", "1035_IOM_SAD_XX.wav"]
for item in files_S:
FILE_PATH_S = PATH_S + item
# load audio file with Librosa
signal, sample_rate = librosa.load(FILE_PATH_S, sr=44100)
# display waveform
plt.figure(figsize=(12, 4))
librosa.display.waveshow(signal, sr=sample_rate, alpha=0.4)
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.yticks(np.arange(-1, 1.25, 0.5))
plt.title(f"Waveform ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
plt.show()
Insight:
The plot above have several insights, as follows:
In sadness emotion with various emotion levels, the amplitude envelope is used to extract features related to the strength and intensity of the sound, which can then be used as features in machine learning models to classify sadness emotion with various emotion levels. Hence, by knowing the amplitude as the strength and intensity of the sound, the customer can be identified by the company for who has certain pain of jobs that need to be done and for who is sad towards the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.
# Display Sadness with High Emotion Level Audio Player - 1054_IEO_SAD_HI.wav
ipd.Audio("data_input/1091_IWW_NEU_XX.wav")
# Display Sadness with Low Emotion Level Audio Player - 1055_IEO_SAD_LO.wav
ipd.Audio("data_input/1050_IWW_NEU_XX.wav")
# Display Sadness with Medium Emotion Level Audio Player - 1043_IEO_SAD_MD.wav
ipd.Audio("data_input/1017_IWW_NEU_XX.wav")
# Display Sadness with Unspecified Emotion Level Audio Player - 1035_IOM_SAD_XX.wav
ipd.Audio("data_input/1001_IWW_NEU_XX.wav")
FIG_SIZE_N = (10,15)
PATH_N = "data_input/"
files_N = ["1091_IWW_NEU_XX.wav", "1050_IWW_NEU_XX.wav", "1017_IWW_NEU_XX.wav", "1001_IWW_NEU_XX.wav"]
for item in files_N:
FILE_PATH_N = PATH_N + item
# load audio file with Librosa
signal, sample_rate = librosa.load(FILE_PATH_N, sr=44100)
# display waveform
plt.figure(figsize=(12, 4))
librosa.display.waveshow(signal, sr=sample_rate, alpha=0.4)
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.yticks(np.arange(-1, 1.25, 0.5))
plt.title(f"Waveform ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
plt.show()
Insight:
The plot above have several insights, as follows:
In neutral emotion with unspecified level, amplitude envelope is used to extract features related to the strength and intensity of the sound, which can then be used as features in machine learning models to classify neutral emotion with unspecified level in the sound. Hence, by knowing the amplitude as the strength and intensity of the sound, the customer can be identified by the company for who has neutral emotion towards the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.
The pitch of a sound is determined by the frequency of the vibrations produced by the vocal cords in the larynx, which is then translated into audible sound waves. Measured in Hertz (Hz), pitch refers to the high or low tone of a sound and can be used to differentiate between different sounds, including human speech. This information can provide insights into the emotions, intonation, and even the identity of the speaker. For example, the intensity of emotions expressed in speech can be conveyed through pitch, with high pitch conveying excitement or anxiety and low pitch indicating boredom or depression.
# Define directory path and initialize empty dataframe
dir_path = 'data_input/'
df_pitch = pd.DataFrame(columns=['filename', 'pitch_mean', 'pitch_std', 'pitch_min', 'pitch_max'])
# Loop through each file in the directory
for filename in os.listdir(dir_path):
# Check if file name starts with a number between 1001 and 1091
if filename.startswith(tuple([str(i) for i in range(1001, 1092)])):
# Extract the number from the file name
number = int(filename.split('_')[0])
# Load audio file and extract pitch features
audio_file = os.path.join(dir_path, filename)
y, sr = librosa.load(audio_file, sr=44100)
pitches, magnitudes = librosa.piptrack(y=y, sr=sr)
pitch_mean = np.mean(pitches)
pitch_std = np.std(pitches)
pitch_min = np.min(pitches)
pitch_max = np.max(pitches)
# Append feature values to the dataframe
df_pitch = pd.concat([df_pitch, pd.DataFrame({
'filename': [filename],
'number': [number],
'pitch_mean': [pitch_mean],
'pitch_std': [pitch_std],
'pitch_min': [pitch_min],
'pitch_max': [pitch_max]
})], ignore_index=True)
# Assign the dataframe to a variable
pitch_features = df_pitch
# Sort by filename index-0, index-2, and index-3
pitch_features = pitch_features.sort_values(by=['filename'])
pitch_features[['col0', 'col2', 'col3']] = pitch_features['filename'].str.split('_', expand=True)[[0, 2, 3]]
pitch_features = pitch_features.sort_values(by=['col0', 'col2', 'col3'])
# Drop unnecessary columns
pitch_features = pitch_features.drop(columns=['col0', 'col2', 'col3'])
pitch_features = pitch_features.reset_index(drop=True)
pitch_features=pitch_features.drop(['number'],axis=1)
pitch_features
| filename | pitch_mean | pitch_std | pitch_min | pitch_max | |
|---|---|---|---|---|---|
| 0 | 1001_IEO_ANG_HI.wav | 14.328611 | 169.545547 | 0.0 | 3994.187988 |
| 1 | 1001_IEO_ANG_LO.wav | 9.142864 | 122.549446 | 0.0 | 3994.366699 |
| 2 | 1001_IEO_ANG_MD.wav | 9.704618 | 130.087921 | 0.0 | 3993.887939 |
| 3 | 1001_WSI_ANG_XX.wav | 15.859626 | 189.066589 | 0.0 | 3992.139404 |
| 4 | 1001_IEO_HAP_HI.wav | 16.429127 | 184.560547 | 0.0 | 3965.526123 |
| ... | ... | ... | ... | ... | ... |
| 1178 | 1091_IWW_NEU_XX.wav | 4.151556 | 80.656540 | 0.0 | 3989.319824 |
| 1179 | 1091_IEO_SAD_HI.wav | 3.047120 | 55.561882 | 0.0 | 3826.993896 |
| 1180 | 1091_IEO_SAD_LO.wav | 2.854729 | 54.097889 | 0.0 | 3989.583984 |
| 1181 | 1091_IEO_SAD_MD.wav | 3.207180 | 57.784866 | 0.0 | 3935.103516 |
| 1182 | 1091_IOM_SAD_XX.wav | 1.456717 | 31.115238 | 0.0 | 3892.249023 |
1183 rows × 5 columns
a = pitch_features.iloc[[0,4,9,1,5,10,2,6,11,3,7,12]]
b = pitch_features.iloc[[8,21,34,47]]
pitch_sample = pd.concat([a,b])
pitch_sample = pitch_sample.reset_index(drop=True)
pitch_sample
| filename | pitch_mean | pitch_std | pitch_min | pitch_max | |
|---|---|---|---|---|---|
| 0 | 1001_IEO_ANG_HI.wav | 14.328611 | 169.545547 | 0.0 | 3994.187988 |
| 1 | 1001_IEO_HAP_HI.wav | 16.429127 | 184.560547 | 0.0 | 3965.526123 |
| 2 | 1001_IEO_SAD_HI.wav | 8.878007 | 121.977211 | 0.0 | 3896.472900 |
| 3 | 1001_IEO_ANG_LO.wav | 9.142864 | 122.549446 | 0.0 | 3994.366699 |
| 4 | 1001_IEO_HAP_LO.wav | 9.295731 | 127.963921 | 0.0 | 3965.618408 |
| 5 | 1001_IEO_SAD_LO.wav | 7.366590 | 104.775467 | 0.0 | 3980.411133 |
| 6 | 1001_IEO_ANG_MD.wav | 9.704618 | 130.087921 | 0.0 | 3993.887939 |
| 7 | 1001_IEO_HAP_MD.wav | 11.731209 | 151.486237 | 0.0 | 3990.189697 |
| 8 | 1001_IEO_SAD_MD.wav | 10.933606 | 142.445541 | 0.0 | 3992.952393 |
| 9 | 1001_WSI_ANG_XX.wav | 15.859626 | 189.066589 | 0.0 | 3992.139404 |
| 10 | 1001_TAI_HAP_XX.wav | 11.863870 | 163.126831 | 0.0 | 3988.996338 |
| 11 | 1001_IOM_SAD_XX.wav | 11.345793 | 147.691315 | 0.0 | 3993.049805 |
| 12 | 1001_IWW_NEU_XX.wav | 8.466592 | 120.348625 | 0.0 | 3962.358643 |
| 13 | 1002_IWW_NEU_XX.wav | 5.133738 | 83.827484 | 0.0 | 3992.776123 |
| 14 | 1003_IWW_NEU_XX.wav | 7.740101 | 113.748917 | 0.0 | 3991.625244 |
| 15 | 1004_IWW_NEU_XX.wav | 6.679007 | 100.423401 | 0.0 | 3920.516602 |
Insight:
The data frame above which consists of the pitch with different emotions also different emotions have several insights, as follows:
The pitch value can be used as a reference / train data for machine learning model in emotion speech recognition project. Hence, by knowing the pitch from various emotions and levels, the customer can be identified by the company for who is anger/happiness/sadness/neutral toward the company's specific questionnaire questions/product/service/specific campaign in conducting the market research with this machine learning model.
Energy in prosodic feature extraction refers to the strength of the sound signal in each frame. In audio signals, calculate the root-mean-square (RMS) value for each frame and the total magnitude of the signal can indicate how loud the signal is. Generally, energy is calculated by squaring each sample in the frame and then adding the results. In some cases, the square root of the sum of squares is also calculated to produce an energy value in the same unit as amplitude. Energy features are often used to detect aspects such as intensity, emphasis, and heartbeat in speech.
In the analysis of prosodic feature extraction for energy, we can obtain some insights or understanding of the characteristics of the analyzed sound, including:
Intensity or loudness: The higher the energy value produced, the louder the sound produced by the speaker.
Emotion or expression: Energy values can provide an idea of the expression or emotion conveyed by the speaker. For example, speakers who are angry or happy tend to have higher energy values.
Physical condition: Significant changes in energy values in a speaker's voice can indicate changes in their physical condition. For example, someone who is sick may have lower energy values in their voice.
Diction / Speech style: A person's speech style can be reflected in the energy values of their voice. For example, someone who tends to speak in a monotone or calm intonation may have lower energy values.
In sound processing and prosodic analysis, energy values are often used as features to identify various sound characteristics such as intonation, accent, tempo, and emotion. Therefore, understanding energy values can help in understanding various aspects of human sound and can be used for various applications such as voice recognition, emotion analysis, and natural language processing.
FIG_SIZE_S = (10,15)
PATH_S = "data_input/"
files_S = ["1001_IEO_ANG_HI.wav", "1088_IEO_ANG_LO.wav", "1018_IEO_ANG_MD.wav", "1019_MTI_ANG_XX.wav"]
for item in files_S:
FILE_PATH_S = PATH_S + item
# load audio file with Librosa
y, sr = librosa.load(FILE_PATH_S, sr=44100)
# Display RMS Energy
S, phase = librosa.magphase(librosa.stft(y))
rms = librosa.feature.rms(S=S)
fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
times = librosa.times_like(rms)
ax[0].semilogy(times, rms[0], label='RMS Energy')
ax[0].set(xticks=[])
ax[0].legend()
ax[0].label_outer()
librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
y_axis='log', x_axis='time', ax=ax[1])
plt.title(f"log Power spectrogram of ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
plt.show()
Insight:
The plot above have several insights, as follows:
This sample of anger emotion audio data can be used as a reference / train data for machine learning model in emotion speech recognition project. Hence, by knowing the energy / loudness of anger emotion and the anger characteristics, the customer can be identified by the company for who doesn't love / is unsatisfied toward the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.
FIG_SIZE_S = (10,15)
PATH_S = "data_input/"
files_S = ["1090_IEO_HAP_HI.wav", "1065_IEO_HAP_LO.wav", "1044_IEO_HAP_MD.wav", "1029_IWL_HAP_XX.wav"]
for item in files_S:
FILE_PATH_S = PATH_S + item
# load audio file with Librosa
y, sr = librosa.load(FILE_PATH_S, sr=44100)
# Display RMS Energy
S, phase = librosa.magphase(librosa.stft(y))
rms = librosa.feature.rms(S=S)
fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
times = librosa.times_like(rms)
ax[0].semilogy(times, rms[0], label='RMS Energy')
ax[0].set(xticks=[])
ax[0].legend()
ax[0].label_outer()
librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
y_axis='log', x_axis='time', ax=ax[1])
plt.title(f"log Power spectrogram of ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
plt.show()
Insight:
The plot above have several insights, as follows:
This sample of happiness emotion audio data can be used as a reference / train data for machine learning model in emotion speech recognition project. Hence, by knowing the energy / loudness of happiness emotion and the happiness characteristics, the customer can be identified by the company for who loves / is satisfied toward the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.
FIG_SIZE_S = (10,15)
PATH_S = "data_input/"
files_S = ["1054_IEO_SAD_HI.wav", "1055_IEO_SAD_LO.wav", "1043_IEO_SAD_MD.wav", "1035_IOM_SAD_XX.wav"]
for item in files_S:
FILE_PATH_S = PATH_S + item
# load audio file with Librosa
y, sr = librosa.load(FILE_PATH_S, sr=44100)
# Display RMS Energy
S, phase = librosa.magphase(librosa.stft(y))
rms = librosa.feature.rms(S=S)
fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
times = librosa.times_like(rms)
ax[0].semilogy(times, rms[0], label='RMS Energy')
ax[0].set(xticks=[])
ax[0].legend()
ax[0].label_outer()
librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
y_axis='log', x_axis='time', ax=ax[1])
plt.title(f"log Power spectrogram of ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
plt.show()
Insight:
The plot above have several insights, as follows:
This sample of sadness emotion audio data can be used as a reference / train data for machine learning model in emotion speech recognition project. Hence, by knowing the energy / loudness of sadness emotion and the sadness characteristics, the customer can be identified by the company for who has certain pain of jobs that need to be done and for who is sad towards the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.
FIG_SIZE_S = (10,15)
PATH_S = "data_input/"
files_S = ["1091_IWW_NEU_XX.wav", "1050_IWW_NEU_XX.wav", "1017_IWW_NEU_XX.wav", "1001_IWW_NEU_XX.wav"]
for item in files_S:
FILE_PATH_S = PATH_S + item
# load audio file with Librosa
y, sr = librosa.load(FILE_PATH_S, sr=44100)
# Display RMS Energy
S, phase = librosa.magphase(librosa.stft(y))
rms = librosa.feature.rms(S=S)
fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
times = librosa.times_like(rms)
ax[0].semilogy(times, rms[0], label='RMS Energy')
ax[0].set(xticks=[])
ax[0].legend()
ax[0].label_outer()
librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
y_axis='log', x_axis='time', ax=ax[1])
plt.title(f"log Power spectrogram of ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
plt.show()
Insight:
The plot above have several insights, as follows:
This sample of neutral emotion audio data can be used as a reference / train data for machine learning model in emotion speech recognition project. Hence, by knowing the energy / loudness of neutral emotion and the sadness characteristics, the customer can be identified by the company for who has neutral emotion towards the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.
The Short-time Fourier Transform (STFT) is a signal processing technique that allows us to examine the frequency characteristics of a signal over time. Its main purpose is to analyze signals that exhibit frequency variations over time, such as speech.
By dividing the signal into short, overlapping time segments using a windowing function, the STFT computes the Fourier Transform for each segment, revealing the frequency components present during that specific interval. By performing this analysis on successive time segments, we can track how the frequency content of the signal evolves over time.
In the context of emotion speech recognition classification, the STFT plays a crucial role. It facilitates the capture and analysis of temporal fluctuations in the frequency content of speech signals. This is highly significant because it enables the identification and differentiation of various emotional states conveyed through speech.
The STFT's significance in emotion speech recognition classification projects lies in its ability to effectively capture and analyze the changing frequency characteristics of speech signals over time. This analysis is instrumental in discerning and distinguishing different emotional states expressed in speech.
FIG_SIZE_STFT = (8, 5)
PATH_S = "data_input/"
files_S = ["1082_IEO_ANG_HI.wav", "1047_IEO_HAP_HI.wav", "1049_IEO_SAD_HI.wav", "1001_IWW_NEU_XX.wav", "1017_IWW_NEU_XX.wav"]
for item in files_S:
FILE_PATH_S = PATH_S + item
# load audio file with Librosa
y, sr = librosa.load(FILE_PATH_S)
# STFT -> spectrogram
hop_length = 512 # in num. of samples
n_fft = 2048 # window in num. of samples
# calculate duration hop length and window in seconds
hop_length_duration = float(hop_length) / sr
n_fft_duration = float(n_fft) / sr
print("STFT hop length duration is: {} second".format(hop_length_duration))
print("STFT window duration is: {} second".format(n_fft_duration))
# perform stft
stft = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
# calculate absolute values on complex numbers to get magnitude
spectrogram = np.abs(stft)
# apply logarithm to cast amplitude to Decibels
log_spectrogram = librosa.amplitude_to_db(spectrogram)
# extract name parts from the file name
name_parts = item.split('_')
name_parts_selected = name_parts[2:4] # select parts at index 2 and 3
# combine selected name parts with underscore separator
title = '_'.join(name_parts_selected)
# display spectrogram
plt.figure(figsize=FIG_SIZE_STFT)
librosa.display.specshow(log_spectrogram, sr=sr, hop_length=hop_length, x_axis='time')
plt.xlabel("Time")
plt.ylabel("Frequency")
plt.colorbar()
plt.title("Spectrogram - {}".format(title)) # Judul plot dengan nama file terpilih
plt.show()
STFT hop length duration is: 0.023219954648526078 second STFT window duration is: 0.09287981859410431 second
STFT hop length duration is: 0.023219954648526078 second STFT window duration is: 0.09287981859410431 second
STFT hop length duration is: 0.023219954648526078 second STFT window duration is: 0.09287981859410431 second
STFT hop length duration is: 0.023219954648526078 second STFT window duration is: 0.09287981859410431 second
STFT hop length duration is: 0.023219954648526078 second STFT window duration is: 0.09287981859410431 second
Insight:
The amplitude-frequency distribution of the anger emotion with a high-level audio signal has a wide time range, and the frequency amplitude is mostly distributed in the time range above 0 - 2 seconds. And, this indicates that there are dominant frequency components within that time range that may be relevant to the expression of the "anger" emotion. This information can be useful in identifying and distinguishing the "anger" emotion of customer who doesn't love/is unsatisfied with the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model which uses STFT analysis on audio signals.
The amplitude-frequency distribution of the happiness emotion with a high-level audio signal has amplitude frequency whose distribution is fairly evenly distributed and the peak amplitude of the dominant frequency is distributed in the time range above 0 to 1 second. This information can be useful in identifying and distinguishing the "happiness" emotion of customer who loves / is satisfied toward the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.
The amplitude-frequency distribution of sad emotions with high-level audio signals has a weak amplitude-frequency distribution in the time range 2 - 2.5 second. This information can be useful in identifying and distinguishing the "sadness" emotion of customer who has certain pain of jobs that need to be done and for who is sad towards the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.
The frequency amplitude distribution of neutral emotion with unspecified level audio exhibits a higher literacy for the distribution of frequency amplitudes. This is because audio for neutral emotion has various characteristics, particularly in terms of frequency amplitude distribution.
So, the highest level of amplitude with a long duration of amplitude density corresponds to the anger emotion. Next, the emotions of happiness and neutral exhibit relatively low to moderate levels of amplitude with density in the intermediate duration. Lastly, the amplitude levels in a small range with density in the shorter duration indicate the sad emotion.
The Mel-Frequency Cepstral Coefficients (MFCCs) are widely utilized in speech recognition and involve converting the power spectrum of a sound into a Mel-scale, which accounts for the human auditory perception. This enables differentiation of voices based on their distinct frequency ranges. MFCCs, which typically consist of a small set of features (usually around 10-20), succinctly describe the overall shape of the spectral envelope and capture the characteristic properties of the human voice.
The MFCC coefficients effectively capture the essential spectral characteristics of the audio signal, highlighting the perceptually significant components. These coefficients can be employed as features for diverse tasks in audio and speech processing, including speech recognition, speaker identification, and emotion recognition. MFCC provides a concise representation of the spectral content of an audio signal, incorporating both frequency and perceptual characteristics. This makes it a potent tool for the analysis and processing of audio signals.
for item in files_S:
FILE_PATH_S = PATH_S + item
# load audio file with Librosa
y, sr = librosa.load(FILE_PATH_S, sr=44100)
# Display RMS Energy
S, phase = librosa.magphase(librosa.stft(y))
rms = librosa.feature.rms(S=S)
fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
times = librosa.times_like(rms)
ax[0].semilogy(times, rms[0], label='RMS Energy')
ax[0].set(xticks=[])
ax[0].legend()
ax[0].label_outer()
librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
y_axis='log', x_axis='time', ax=ax[1])
plt.title(f"log Power spectrogram of ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
plt.show()
FIG_SIZE_STFT = (8, 5)
PATH_S = "data_input/"
files_S = ["1082_IEO_ANG_HI.wav", "1047_IEO_HAP_HI.wav", "1049_IEO_SAD_HI.wav", "1001_IWW_NEU_XX.wav", "1017_IWW_NEU_XX.wav"]
for item in files_S:
FILE_PATH_S = PATH_S + item
# load audio file with Librosa
x, sr = librosa.load(FILE_PATH_S, sr=44100)
mfccs = librosa.feature.mfcc(y=x, sr=sr)
# Extract file name
file_name = item.split("_")[2] + "_" + item.split("_")[3]
# Display the MFCCs with the file name as the plot title
fig, ax = plt.subplots(figsize=(15, 3))
img = librosa.display.specshow(mfccs, sr=sr, x_axis='time')
fig.colorbar(img, ax=ax)
ax.set(title=file_name)
plt.show()
Insight:
FIG_SIZE_STFT = (8, 5)
PATH_S = "data_input/"
files_S = ["1082_IEO_ANG_HI.wav", "1047_IEO_HAP_HI.wav", "1049_IEO_SAD_HI.wav", "1001_IWW_NEU_XX.wav", "1017_IWW_NEU_XX.wav"]
for item in files_S:
FILE_PATH_S = PATH_S + item
# load audio file with Librosa
y, sr = librosa.load(FILE_PATH_S, sr=44100)
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_dB = librosa.power_to_db(S, ref=np.max)
# Extract file name
file_name = item.split("_")[2] + "_" + item.split("_")[3]
# Display the Mel-frequency spectrogram with the file name as the plot title
fig, ax = plt.subplots(figsize=(15, 3))
img = librosa.display.specshow(S_dB, sr=sr, x_axis='time')
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency spectrogram'+' '+file_name)
plt.show()
Insight:
High sound intensity is exhibited by audio with anger and happiness emotions, followed by neutral emotion, and finally sadness emotion.
This analysis is important because the frequency information contained in the sound can provide valuable clues about the emotions expressed in speech. By analyzing the spectrogram, we can identify patterns of sound intensity that are associated with specific emotions and use this information to classify emotions in unknown audio data.